Karthik Durvasula: Perception as sampling with categorical representations. Part 3

Background

We are back to our regular programming, folks!

In a couple of previous blog posts (Durvasula 2025b, 2025c), I discussed how perception can be viewed as repeated sampling with categorical representations. The view developed allowed us to explain/derive (not just account for) a few things.

First, the gradience observed in perceptual experimentation - as an averaging artefact. But, this was already foreseen in Massaro and Cohen (1983). Second, as a task becomes harder, the perceptual function becomes more linear. Third, reaction times are longer near the categorical boundary than away from it. Fourth, neural measures (such as N100) are more linear than behavioural measures.

Furthermore, I presented two ways in which the sampling procedure could be envisioned. The first where the sampling was in parallel and the second where it was in serial. The current post delves into this particular issue of how to best envision the sampling procedure. And the basic idea of a simultaneous parallel-serial approach was Spencer Caplan’s — when he implemented the ideas discussed in the other blog posts, he went in a different and interesting direction. But, in discussions, we realised that there are theoretical advantages to having such a system.

A parallel sampling procedure is one where the audio input is simultaneous sampled multiple times. The advantage of this system is that there is a way to hold on to indeterminacy in categorisation (or ambiguity in the signal) through such a process, as long as the decision (threshold) used for sampling has some stochasticity about it. That is, the initial representation of the signal itself can be in the form of categories if we have a parallel sampling procedure. The disadvantage with such a system is that we lose a mechanistic understanding of the reaction time results, which depend on some sort of a serial search/sampling procedure (as I hinted in the first post).

A serial sampling procedure is one where the audio input needs to be stored temporarily and then that representation is pinged repeatedly (until some sort of decision threshold is reached) to arrive at a percept. The disadvantage here is the need for another intermediate gradient/continuous representational format, that will have to be gradient if we are to save some sort of information about the ambiguity in the signal. This isn’t a problem by itself, since we have a sort of gradient physical representation in the cochlea, but that is ephemeral as it changes as the incoming sound changes. How sweet would it be to not use a gradient cognitive representation at all??!! However, the advantage of the serial sampling view is that we get a mechanistic understanding of the reaction time results because of the serial sampling procedure.

Now, all of the above was in the previous blog posts, but not highlighted. Spencer’s stroke of genius was to ask (perhaps not in these words exactly, so assume this is a rational reconstruction of what actually transpired :)), what if we combined both the sampling procedures? What if the initial representation was parallel (to get a set of initial categorical representations), but the subsequent or secondary (behavioural) percept was based on serial sampling of the initial representation? Such a view, maintains the advantages for the parallel and serial approaches while chucking their disadvantages.

Now, the question is, what’s the minimal system we can get away with? And it turns out that, at least for the extant data, we can get away with just a set of just two initial parallel samples and set of less than four serial samples. Both are within the 4 item limit for working memory that Cowan (2010) discussed. In what follows, I will generate most of the crucial results with such a system, and then compare it to actual data.

Here is the model in a nut shell: The initial percept of the audio input is in terms of N categorical segments (even just two is enough) — this term is called rep in the functions below. Then, there is repeated sampling from the initial set till you hit a sequence of M identical parses (a sequence of less than 5 is enough) — this term is called runSize in the functions below. That’s it, folks!

Intuitively, a few results should follow. First, the reaction times will depend on the proximity to category boundary — same as the other models in the other blog post.

Second, tasks with a high memory load or more variability will have a shallower slope for the same reason as in the serial model in the other blog post.

Third, and crucially, each participant must have a categorical secondary parse — different from the models in the other blog post. So, the gradience observed is through an averaging artefact over multiple trials and multiple participants. In essence, the new model rejects the claim that Massaro and Cohen (1983). Massaro and Cohen (1983)’s original argument/claim was that there is within speaker gradience, and they use a rating scale to show this. However, here is their task: “Responses were recorded for 32 blocks of 7 stimuli, sampled randomly without replacement. Thus, a total of 896 ratings were collected for each subject, 128 measures for each of the 7 stimuli”. Their results are therefore compatible with a view where each individual percept of each stimulus is categorical, but the average over the percepts looks gradient — we are back to the possibility of an averaging artefact! And there is no reason from their study to assume that each percept is gradient. In fact, it should be obvious, given every perceptual task is likely to have repetitions, that an average artefact based theory is always in play!

Functions

First, let’s define the functions. Let’s define a perceiver that immediately categorises the input into N categories using some categorical boundary. This is the same function as in the previous blog posts.

hide

categoricalPerceiver = function(input, boundary, boundarySD, reps=2){
  
  # Getting the categorical boundaries 
  CategoricalBoundaries = rnorm(reps, mean=boundary, 
                                sd = boundarySD)
  
  
  #Creating a data.frame with repetition information.
  Results = data.frame(Reps = 1:reps)
  
  #Adding the subject number and their categorical boundary
  #Then giving a percept based on the boundary
  Results %>% 
    mutate(Percept = ifelse(input<CategoricalBoundaries,"Cat1","Cat2"))
}

#Output of the categorisation event for each subject
# categoricalPerceiver(30, 30, 10, 5)

Next, let’s use the initial set of percepts to resample from to create a subsequent or secondary (behavioural) percept, and then stop sampling when there is a run of M identical percepts.

hide

#Repeated sampling till a sequence of M identical percepts are obtained.

RepeatedSamplerTillRunAchieved = function(input=30,
                                          boundary, boundarySD,
                                          outputTotalCount=T,
                                          reps=2,
                                          runSize=4,
                                          PerceiverModel=categoricalPerceiver){
  # input=30
  # boundary=30
  # boundarySD=5
  # reps = 2
  # runSize = 5
  # PerceiverModel=categoricalPerceiver
  
  # Getting the initial set of percepts
  initialParse = PerceiverModel(input,boundary,boundarySD,reps)$Percept
  
  #Initialising the counter and the resampling sequence, and then resampling
  counter=0
  resampledSequence = c()
  repeat{
    counter = counter + 1
    
    # Getting the last few secondary-percepts to see if there is a run
    resampledSequence[counter]=sample(x=initialParse,size=1,replace=T)
    
    # Was there a sufficiently long run? 
    # but first there needs to be at least as many samples
    if(counter>=runSize){
      resampledSequenceRun = resampledSequence[(counter-(runSize-1)):counter]
      # If there is indeed a run of categories of the desired length, then stop
      if(sum(resampledSequenceRun[1]==resampledSequenceRun)==runSize){
      
        # return(list(ResampleSequence=resampledSequence,
        return(data.frame(TotalNumberOfIteration=counter,
                          FinalPercept=resampledSequenceRun[1]))
        break
      }
    }
  }
}

# Example output
# RepeatedSamplerTillRunAchieved(input=5,boundary=30,boundarySD=100,reps=2,runSize=10)

We need one last averaging function across participants to get mean proportion of responses.

hide

SubjectAverager = function(NumSubjects=30,input=30,
                           boundary, boundarySD,
                           reps=2,runSize=4,
                           PerceiverModel=categoricalPerceiver,
                           SecondaryPerceiverModel=RepeatedSamplerTillRunAchieved){
  
  SubjectSimulation=data.frame(Sub=1:NumSubjects) %>% 
    group_by(Sub) %>% 
    nest() %>% 
    mutate(Results=SecondaryPerceiverModel(input,boundary,boundarySD,reps,
                                           runSize=runSize,PerceiverModel=categoricalPerceiver)) %>% 
    select(-data) %>% 
    unnest()
  
  RTAvg = mean(SubjectSimulation$TotalNumberOfIteration)
  Cat2PerceptProp = sum(SubjectSimulation$FinalPercept=="Cat2")/nrow(SubjectSimulation)
    
  # Returning the info
  data.frame(Cat2PerceptProp,RTAvg)
}

# SubjectAverager(input=5,boundary=30,boundarySD=5,reps=2,runSize=4)

This new model will still get all the results that I discussed in previous blog posts. Let’s look at each case.

Simulations

Note, in all the results, I set the reps (initial set of parallel percepts) to 2, and the runSize to 4. Incidentally, changing the runSize makes little difference to the categorisation function — this makes sense since the original percept already tracks the function. But, for RTs, there needs to be a run. If there is only one percept, then every resampling will immediately result in a run and there won’t be any variation in reaction times.

The RT will depend on the proximity to the category boundary

The new model can of course derive the gradient categorical function and the relationship between reaction times and distance from the categorical boundary.

hide

# Getting an input range
Results=data.frame(input=seq(0,60,by=3)) %>% 
  mutate(inputCopy=input) %>% 
  group_by(input) %>% 
  nest() %>% 
  mutate(Results=map(data,function(x=data){SubjectAverager(NumSubjects=30,input=x$inputCopy,
                boundary=30,boundarySD=5,reps=2,runSize=4)})) %>% 
  select(-data) %>% 
  unnest()

# Categorical function
Results %>% 
  ggplot(aes(input,Cat2PerceptProp)) +
  geom_point()+geom_line() +
  xlab("Input value") +
  ylab("Proportion of Cat2 responses")

hide

# RTs
Results %>% 
  ggplot(aes(input,RTAvg)) +
  geom_point()+geom_line() +
  xlab("Input value") + 
  ylab("Average RT for decision")

High memory load will have a shallower slope — same as the other models in the blog post.

The model also correctly gets that as the memory load increases (modelled as an increase in the standard deviation), the reaction times increase.

hide

#Creating a data.frame with input values for each memory load condition

LowMemoryLoad = data.frame(input=seq(0,60,by=3)) %>% 
  mutate(inputCopy=input) %>% 
  group_by(input) %>% 
  nest() %>% 
  mutate(Results=map(data,function(x=data){SubjectAverager(NumSubjects=60,input=x$inputCopy,
                boundary=30,boundarySD=5,reps=2,runSize=4)})) %>% 
  select(-data) %>% 
  unnest() %>% 
  mutate(LoadType = "Low memory load")

HighMemoryLoad = data.frame(input=seq(0,60,by=3)) %>% 
  mutate(inputCopy=input) %>% 
  group_by(input) %>% 
  nest() %>% 
  mutate(Results=map(data,function(x=data){SubjectAverager(NumSubjects=60,input=x$inputCopy,
                boundary=30,boundarySD=20,reps=2,runSize=4)})) %>% 
  select(-data) %>% 
  unnest() %>% 
  mutate(LoadType = "High memory load")
  
#Plot of averaged values
rbind(LowMemoryLoad,HighMemoryLoad) %>%
  ggplot(aes(x=input, y=Cat2PerceptProp,colour=LoadType)) +
  geom_point() +
  geom_line() +
  xlab("Input value") +
  ylab("Proportion of Cat2 percept")

High memory load will have longer reaction times

Something that I didn’t highlight in the previous blog posts is that all the model predict that higher memory loads will allows have slower reactions times.

hide

#Creating a data.frame with input values for each memory load condition

#Plot of averaged RT values
rbind(LowMemoryLoad,HighMemoryLoad) %>%
  ggplot(aes(x=input, y=RTAvg,colour=LoadType)) +
  geom_point() +
  geom_line() +
  xlab("Input value") +
  ylab("Average RT for decision")

Some evidence for the view

So far, I have shown you simulations and said these are the predictions. Here finally is some data from a recent paper by Caplan, Hafri, and Trueswell (2021). They ran a binary t~d identification task on stimuli varying in VOT along with a variety of other conditions. They had a huge number of participants (N=154), so the dataset forms a great way to establish some results without worrying about accidental patterns (at least, for the broad generalisations).

Here is more info about their stimuli and experiment: “consisting of 162 trials. Participants received new instructions telling them to press a button to decide whether the audio they heard was ta or da. The side of the screen on which the ta and da choices appeared was consistent within each participant but randomized between participants. On each trial, participants were exposed to audio of a CV syllable edited along a continuum between ta and da. After listening to the audio, participants were asked to judge whether the syllable contained t or d. The 162 test trials were divided between two exemplar ta/da tokens and nine VOT levels (20, 30, 40, 45, 50, 55, 60, 70, and 80 ms), with nine repetitions for each exemplar and level (2 × 9 × 9). The order of test items was randomized within a set of nine blocks such that every stimulus was heard once before it was repeated.”

A sample of their results is shown below, where I have suppressed a few columns along with the subjectid, just to make sure I can fit it on the page.

hide

Data=read.csv("CHT_large_results_lastexp.csv") %>% 
  mutate(t.d_choice=as.factor(t.d_choice)) %>% 
  select(uniqueid,rt,VOT,t_choice,t.d_choice)

# length(unique(Data$uniqueid))
head(Data %>% select(-uniqueid))

       rt VOT t_choice t.d_choice
1 124.160  60        1          t
2 217.910  50        1          t
3 283.415  70        1          t
4 906.715  40        1          t
5 319.415  50        1          t
6 346.455  40        1          t

hide

# Identification curve
Data %>% 
  group_by(VOT) %>%
  summarise(meanTProp = mean(t_choice)) %>%
  ggplot(aes(VOT,meanTProp))+
  geom_line() +
  ylab("Mean proportion of t response")

Reaction time data

As you can see below, the reaction times for the different VOT values have exactly the predicted shape. The median categorical boundary for the participants was 43 ms! You can judge how close a fit the expected reaction time peak is to that!

hide

# Plotting VOT and RTs
Data %>% 
  group_by(uniqueid) %>% 
  mutate(RTnorm=scale(rt)) %>% 
  ungroup() %>% 
  group_by(uniqueid,VOT) %>%
  summarise(MeanRTnorm=mean(RTnorm),
            MeanDResponse=mean(1-t_choice)) %>%
  ggplot(aes(VOT,MeanRTnorm)) +
  # geom_smooth(aes(y=MeanDResponse),se=F,method="loess")
  geom_smooth(se=F,method="loess") + 
  xlab("VOT (ms)")+
  ylab("Mean normalised RTs")

Variation in perception function slopes

Now, as I have ranted a few times (Durvasula 2024, 2025a), a lot of people make a big deal about variation observed in experiments and immediately leap to the conclusion that it is individual variation. In reality, even if all the participants had exactly the same generative system, we can still get the variation simply as a by-product of a certain amount of noise in the generation which is the same across all the participants.

The same is true with perceptual data. McMurray (2022) makes a big deal of individual variation, but it is not clear that it is anything except random variation. Below are the individual slopes for each participant in the experiment. I fit some low tech simple logistic curves for each participant. OK¹, shoot me for not fitting mixed effects models! I just wanted a quick-and-dirty sigmoidal curve for each participant.

hide

# Fitting logistic regression models to each participant
# And then plotting the slope estimates
IndividualFits=Data %>% 
  group_by(uniqueid) %>% 
  nest() %>% 
  # Didn't use glmer cos I wasn't sure of the r.e. structure
  mutate(Results=map(data,function(x=data){tidy(glm(data=x,t.d_choice~VOT,family="binomial"))})) %>% 
  select(-data) %>% 
  unnest() 

# Plotting
IndividualFits %>% 
  filter(term=="VOT") %>% 
  filter(estimate<1) %>%  #Just seeing what the data looks like with this person
  ggplot(aes(x=estimate))+
    geom_density()+ 
  xlab("Slope")+
  ylab("Distribution of slopes of the identification function")

Now, we have identified the intercept and the slope of the linear function that best fits the perceptual curve of each participant. Since, logistic regression involves a log-odds transform of probabilities fitted to a linear function, for each participant, we have:

\(log(\frac{p}{1-p}) = b0 + b1*x\)

By transforming the above equation, we can identify the categorical boundary (p=0.5):

\[\begin{align*} x &= (log(0.5/(1-0.5)) - b0)/b1 \\ &= (0 - b0)/b1 \\ &= -b0/b1 \\ \end{align*}\]

hide

# Getting category boundaries
IndividualCategoricalBoundaries=IndividualFits %>% 
  select(uniqueid,term,estimate) %>% 
  pivot_wider(names_from=term,values_from=estimate) %>% 
  mutate(categoricalBoundary = -`(Intercept)`/VOT)

#mean/median categorical boundary
meanCategoricalBoundary = round(mean(IndividualCategoricalBoundaries$categoricalBoundary,2))
medianCategoricalBoundary = round(median(IndividualCategoricalBoundaries$categoricalBoundary,2))
# plot(density(IndividualCategoricalBoundaries$categoricalBoundary))

#Standard deviation of categorical boundaries
sdCategoricalBoundaries = round(sd(IndividualCategoricalBoundaries$categoricalBoundary),2)

From that calculation, the mean and median categorical boundaries (the latter was mentioned earlier) were 43 ms and 43 ms respectively. We can also calculate the standard deviation of categorical boundaries across all participants (5.92 ms). We can use this information (mean and sd) to generate our predicted categorisation functions for all the participants assuming that all participants had exactly the same parameter values — that is, assuming that there is no underlying variation between participants at all.

As can be seen, we are able to simulate the average perceptual response curve. I will say that the model seems to have less variance than the actual human participants (shown in the fact that the slope is steeper for the model). I need to think about why that is the case — maybe, it’s a coding error somewhere or maybe I have simply not accounted for other sources of variance (a very likely possibility).

hide

DataGenerated = Data %>% 
  select(uniqueid,VOT) %>%
  left_join(IndividualCategoricalBoundaries %>% select(uniqueid,categoricalBoundary)) %>%
  mutate(SimNum = row_number()) %>% 
  group_by(SimNum) %>% 
  # mutate(FinalPercept = RepeatedSamplerTillRunAchieved(input=VOT,boundary=categoricalBoundary,boundarySD=sdCategoricalBoundaries)) %>%
  mutate(FinalPercept = RepeatedSamplerTillRunAchieved(input=VOT,boundary=meanCategoricalBoundary,boundarySD=sdCategoricalBoundaries)) %>%
  unnest() %>% 
  ungroup() %>% 
  mutate(t_choice = ifelse(FinalPercept=="Cat2",1,0)) %>% 
  mutate(Type="Simulated")

# Plotting the average perceptual contours
Data %>% 
  mutate(Type="Actual") %>% 
  select(uniqueid,VOT,t_choice,Type) %>% 
  rbind(DataGenerated %>% select(uniqueid,VOT,t_choice,Type)) %>% 
  group_by(Type, VOT) %>%
  summarise(meanTProp = mean(t_choice)) %>%
  ggplot(aes(VOT,meanTProp,colour=Type),group=Type)+
  geom_line()

We can now see what the predicted number of iterations (our proxy for reaction times) look like for the data. It peaks at the expected mean category boundary!

hide

DataGenerated %>% 
  # mutate(RTnorm=scale(rt)) %>% 
  # ungroup() %>% 
  group_by(uniqueid,VOT) %>%
  summarise(MeanIterations=mean(TotalNumberOfIteration),
            MeanDResponse=mean(1-t_choice)) %>%
  ggplot(aes(VOT,MeanIterations)) +
  # geom_smooth(aes(y=MeanDResponse),se=F,method="loess")
  geom_smooth(se=F,method="loess") + 
  xlab("VOT (ms)")+
  ylab("Mean number of iterations")

Conclusion

At this point, it should be clear that there are a variety of categorical representation models that can capture crucial aspects of results used to argue for gradient representations — parallel sampling models, serial sampling models, and parallel-serial models. These categorical models come with a host of predictions and insight into reaction time results, that were previous ignored. The empirical question is of couse: which of these models best explains what we have?

One can make a deeper argument in favour of the parallel-serial model I presented above. Why is the perceptual system both parallel and serial? Note, if the perceptual system cares about tracking any amount of ambiguity in the signal, but can only work with categorical representations, then it must have parallel sampling of the input, since the input is ephemeral. Now, that is a sort of empirical reason to model the system with an initial parallel sampler. However, a deeper theoretical reason is that the initial interface with the input signal is a one-way communication system, and not a duplex system. Parallel information transmission is great at one-way transmission (see here and here for discussion). But, parallel transmission is more complicated, generally — needs excellent synchronisation between lines, more infrastructure, …

Once we have an internal representation, then we can have the possibility of two-way communication between that initial internal representation and any other system. And serial systems excel at duplex or two-way communication. Furthermore, if you want a stable but unique representation, then parallel sampling will likely not provide it to you.

It is kinda cool that we can understand speech perception in terms that apply to networking. It is that kind of cross-fertilisation of ideas that is likely to further our understanding of biological systems. Fans of Gallistel and King (2011) should see echoes of their general plea to take the metaphor of computation and memory, seriously. Here, I guess the plea is to take the theory of networking seriously. Although, I should say the last few paragraphs are just some navel-gazing right now based on a more general and vague understanding of the theory of networking on my part.

Caplan, Spencer, Alon Hafri, and John C. Trueswell. 2021. “Now You Hear Me, Later You Don’t: The Immediacy of Linguistic Computation and the Representation of Speech.” Psychological Science 32 (3): 410–23. https://doi.org/10.1177/0956797620968787.

Cowan, Nelson. 2010. “The Magical Mystery Four: How Is Working Memory Capacity Limited, and Why?” Current Directions in Psychological Science 19 (1): 51–57. https://doi.org/10.1177/0963721409359277.

Durvasula, Karthik. 2024. “Karthik Durvasula: Individual Variation?” https://karthikdurvasula.gitlab.io/posts/2024-02-14-individual-variation/.

———. 2025a. “Karthik Durvasula: Near Mergers.” https://karthikdurvasula.gitlab.io/posts/2025-04-29-Near Mergers/.

———. 2025b. “Karthik Durvasula: Perception as Sampling with Categorical Representations.” https://karthikdurvasula.gitlab.io/posts/2025-06-02-Perception as repeated sampling/.

———. 2025c. “Karthik Durvasula: Perception as Sampling with Categorical Representations. Part 2.” https://karthikdurvasula.gitlab.io/posts/2025-06-27-Perception as repeated part 2/.

Gallistel, Charles R, and Adam Philip King. 2011. Memory and the Computational Brain: Why Cognitive Science Will Transform Neuroscience. John Wiley & Sons.

Massaro, Dominic W, and Michael M Cohen. 1983. “Categorical or Continuous Speech Perception: A New Test.” Speech Communication 2 (1): 15–35.

McMurray, Bob. 2022. “The Myth of Categorical Perception.” The Journal of the Acoustical Society of America 152 (6): 3819–42.

Trigger warning, literally.↩︎

Perception as sampling with categorical representations. Part 3

Background

Functions

Simulations

The RT will depend on the proximity to the category boundary

High memory load will have a shallower slope — same as the other models in the blog post.

High memory load will have longer reaction times

Some evidence for the view

Reaction time data

Variation in perception function slopes

Conclusion

References

Citation